Predicting the closing price of a stock is a complex problem because of several challenges. Stock prices are influenced by a multitude of factors such as market trends, Analyzing and incorporating all these factors accurately into a predictive model is a complex task. Market volatility makes predicting stock prices accurately challenging. Data Quality and Quantity, the pursuit of solving this problem is crucial because accurate stock price predictions have significant implications for investors, financial institutions, and businesses. Accurate predictions can aid investors in making informed decisions. The importance of predicting stock prices lies in its implications for investors, financial institutions, and businesses, it can potentially help investors make more informed decisions about buying, selling, or holding stocks, aiding in risk.
In our project, we will use two data mining tasks to help us predict the closing price of a stock. two of the methods you can consider are classification and clustering. For classification, we will train our model to be able to classify the close price based on a set of attributes such as volume, open, high, low, length etc. For clustering, we will partition closing prices into subnets or clusters, where they are similar to prices in cluster but dissimilar to prices in other clusters based on the attributes Low, Heigh, Open, volume, adjClose, adjHigh.
Our dataset is from the source: https://www.kaggle.com/datasets/shreenidhihipparagi/google-stock-prediction
Number of Attributes: 14
Number of objects: 1258
Attribute characteristics:
| Attribute Name | Data Type | Description |
| symbol | unique value | Name of company |
| date | numeric | date: day, month, and year. |
| close | numeric | closing price of a stock is the final price at which a stock is traded on a given trading day. |
| high | numeric | The highest price at which a stock traded during a specific trading day. |
| low | numeric | The lowest price at which a stock traded during a specific trading day. |
| open | numeric | The price of a stock at the beginning of a trading day. It’s the price at which the first trade occurred on that day. |
| Volume | numeric | The total number of shares traded during a trading day. Volume is a measure of market activity and liquidity for a stock |
| adjClose | numeric | The closing price of a stock adjusted for any corporate actions like dividends, stock splits, or other events that could affect the stock price. |
| adjHigh | numeric | The highest price of a stock during a trading day, adjusted for any corporate actions |
| adjLow | numeric | The lowest price of a stock during a trading day, adjusted for any corporate actions. |
| adjOpen | numeric | The opening price of a stock at the beginning of a trading day, adjusted for any corporate actions. |
| adjVolume | numeric | The trading volume of a stock adjusted for any corporate actions. This can provide a clearer picture of tranding activity. |
| divCash | Binary | The amount of money paid by a company to its shareholders as a portion of its profits. Dividends are typically paid on a per-share basis |
| s plitFactor | Binary | If a stock undergoes a stock split, the split factor indicates the ratio by which the shares were split. For instance, a 2-for-1 split means that for every old share, you now have 2 new shares. |
# Load necessary packages
if (!require(caret)) {
install.packages("caret")
}
Loading required package: caret
Loading required package: lattice
if (!require(cluster)) {
install.packages("cluster")
}
if (!require(fpc)) {
install.packages("fpc")
}
Loading required package: fpc
Warning: package ‘fpc’ was built under R version 4.3.2
if (!require(ggplot2)) {
install.packages("ggplot2")
}
library(caret)
library(cluster)
library(fpc)
library(ggplot2)
dataset = read.csv('Google.csv')
View(dataset)
print(dataset)
we removed the attributes (symbol, divCash, splitFactor) as they have one value only so we do not need them
dataset=dataset[,2:12]
Convert the date column to a date format
dataset$date <- as.Date(dataset$date, format = "%Y-%m-%d %H:%M:%S")
print(dataset)
str(dataset)
'data.frame': 1258 obs. of 11 variables:
$ date : Date, format: "2016-06-14" ...
$ close : num 718 719 710 692 694 ...
$ high : num 722 723 717 709 702 ...
$ low : num 713 717 703 688 693 ...
$ open : num 716 719 715 709 699 ...
$ volume : int 1306065 1214517 1982471 3402357 2082538 1465634 1184318 2171415 4449022 2641085 ...
$ adjClose : num 718 719 710 692 694 ...
$ adjHigh : num 722 723 717 709 702 ...
$ adjLow : num 713 717 703 688 693 ...
$ adjOpen : num 716 719 715 709 699 ...
$ adjVolume: int 1306065 1214517 1982471 3402357 2082538 1465634 1184318 2171415 4449022 2641085 ...
summary(dataset)
date close high
Min. :2016-06-14 Min. : 668.3 Min. : 672.3
1st Qu.:2017-09-12 1st Qu.: 960.8 1st Qu.: 968.8
Median :2018-12-11 Median :1132.5 Median :1143.9
Mean :2018-12-12 Mean :1216.3 Mean :1227.4
3rd Qu.:2020-03-12 3rd Qu.:1360.6 3rd Qu.:1374.3
Max. :2021-06-11 Max. :2521.6 Max. :2527.0
low open volume adjClose
Min. : 663.3 Min. : 671 Min. : 346753 Min. : 668.3
1st Qu.: 952.2 1st Qu.: 959 1st Qu.:1173522 1st Qu.: 960.8
Median :1117.9 Median :1131 Median :1412588 Median :1132.5
Mean :1204.2 Mean :1215 Mean :1601590 Mean :1216.3
3rd Qu.:1348.6 3rd Qu.:1361 3rd Qu.:1812156 3rd Qu.:1360.6
Max. :2498.3 Max. :2525 Max. :6207027 Max. :2521.6
adjHigh adjLow adjOpen adjVolume
Min. : 672.3 Min. : 663.3 Min. : 671 Min. : 346753
1st Qu.: 968.8 1st Qu.: 952.2 1st Qu.: 959 1st Qu.:1173522
Median :1143.9 Median :1117.9 Median :1131 Median :1412588
Mean :1227.4 Mean :1204.2 Mean :1215 Mean :1601590
3rd Qu.:1374.3 3rd Qu.:1348.6 3rd Qu.:1361 3rd Qu.:1812156
Max. :2527.0 Max. :2498.3 Max. :2525 Max. :6207027
mean of closing price Using the mean closing price can serve as a basic reference point or a simple benchmark for forecasting future stock prices. The mean closing price is the average price at which a stock has closed over a specific period.
mean(dataset$close)
[1] 1216.317
variance Code
The concept of variance in the context of closing prices for stock prediction serves to quantify the spread or dispersion of the closing prices around their mean or average value. It provides a measure of how much the actual closing prices deviate from the average closing price over a specific period.
var(dataset$close)
[1] 146944.5
Summaries for all numeric attributes and their outliers and boxplots.
#stastistical measures
#summaries
summary(dataset$close)
Min. 1st Qu. Median Mean 3rd Qu. Max.
668.3 960.8 1132.5 1216.3 1360.6 2521.6
summary(dataset$high)
Min. 1st Qu. Median Mean 3rd Qu. Max.
672.3 968.8 1143.9 1227.4 1374.3 2527.0
summary(dataset$low)
Min. 1st Qu. Median Mean 3rd Qu. Max.
663.3 952.2 1117.9 1204.2 1348.6 2498.3
summary(dataset$open)
Min. 1st Qu. Median Mean 3rd Qu. Max.
671 959 1131 1215 1361 2525
summary(dataset$volume)
Min. 1st Qu. Median Mean 3rd Qu. Max.
346753 1173522 1412588 1601590 1812156 6207027
summary(dataset$adjClose)
Min. 1st Qu. Median Mean 3rd Qu. Max.
668.3 960.8 1132.5 1216.3 1360.6 2521.6
summary(dataset$adjHigh)
Min. 1st Qu. Median Mean 3rd Qu. Max.
672.3 968.8 1143.9 1227.4 1374.3 2527.0
summary(dataset$adjLow)
Min. 1st Qu. Median Mean 3rd Qu. Max.
663.3 952.2 1117.9 1204.2 1348.6 2498.3
summary(dataset$adjOpen)
Min. 1st Qu. Median Mean 3rd Qu. Max.
671 959 1131 1215 1361 2525
summary(dataset$adjVolume)
Min. 1st Qu. Median Mean 3rd Qu. Max.
346753 1173522 1412588 1601590 1812156 6207027
#outliers
boxplot.stats(dataset$close)$out
[1] 2070.07 2062.37 2098.00 2092.91 2083.51 2095.38 2095.89 2104.11
[9] 2121.90 2128.31 2117.20 2101.14 2064.88 2070.86 2095.17 2031.36
[17] 2036.86 2081.51 2075.84 2026.71 2049.09 2108.54 2024.17 2052.70
[25] 2055.03 2114.77 2061.92 2066.49 2092.52 2091.08 2036.22 2043.20
[33] 2038.59 2052.96 2045.06 2044.36 2035.55 2055.95 2055.54 2068.63
[41] 2137.75 2225.55 2224.75 2249.68 2265.44 2285.88 2254.79 2267.27
[49] 2254.84 2296.66 2297.76 2302.40 2293.63 2293.29 2267.92 2315.30
[57] 2326.74 2307.12 2379.91 2429.89 2410.12 2395.17 2354.25 2356.74
[65] 2381.35 2398.69 2341.66 2308.76 2239.08 2261.97 2316.16 2321.41
[73] 2303.43 2308.71 2356.09 2345.10 2406.67 2409.07 2433.53 2402.51
[81] 2411.56 2429.81 2421.28 2404.61 2451.76 2466.09 2482.85 2491.40
[89] 2521.60 2513.93
boxplot.stats(dataset$high)$out
[1] 2116.500 2078.550 2102.510 2123.547 2105.130 2108.370 2102.030
[8] 2108.820 2152.680 2133.660 2132.735 2130.530 2091.420 2082.010
[15] 2100.780 2094.880 2071.010 2086.520 2104.370 2088.518 2089.240
[22] 2118.110 2128.810 2078.040 2075.000 2125.700 2090.260 2067.060
[29] 2123.560 2109.780 2075.500 2053.100 2057.990 2072.302 2078.210
[36] 2058.870 2050.990 2058.430 2070.780 2093.327 2142.940 2237.310
[43] 2237.660 2255.000 2284.005 2289.040 2275.320 2277.210 2277.990
[50] 2306.597 2306.440 2318.450 2309.600 2295.320 2303.762 2325.820
[57] 2341.260 2337.450 2452.378 2436.520 2427.140 2419.700 2379.260
[64] 2382.200 2382.710 2416.410 2378.000 2322.000 2285.370 2276.601
[71] 2321.140 2323.340 2343.150 2316.760 2360.340 2369.000 2418.480
[78] 2432.890 2442.944 2440.000 2428.140 2437.971 2442.000 2409.745
[85] 2453.859 2468.000 2494.495 2505.000 2523.260 2526.990
boxplot.stats(dataset$low)$out
[1] 2018.380 2042.590 2059.330 2072.000 2078.540 2063.090 2077.320
[8] 2083.130 2104.360 2098.920 2103.710 2097.410 2062.140 2002.020
[15] 2038.130 2021.290 2016.060 2046.100 2071.260 2010.000 2020.270
[22] 2046.415 2021.610 2047.830 2033.370 2072.380 2047.550 2043.510
[29] 2070.000 2054.000 2033.550 2017.680 2026.070 2039.220 2041.555
[36] 2010.730 2014.020 2015.620 2044.030 2056.745 2096.890 2151.620
[43] 2214.800 2225.330 2257.680 2253.714 2238.465 2256.090 2249.190
[50] 2266.000 2284.450 2287.845 2271.710 2258.570 2256.450 2278.210
[57] 2313.840 2304.270 2374.850 2402.280 2402.160 2384.500 2311.700
[64] 2351.410 2342.338 2390.000 2334.730 2283.000 2230.050 2242.720
[71] 2283.320 2295.000 2303.160 2263.520 2321.090 2342.370 2360.110
[78] 2402.990 2412.515 2402.000 2407.690 2404.880 2404.200 2382.830
[85] 2417.770 2441.073 2468.240 2487.330 2494.000 2498.290
boxplot.stats(dataset$open)$out
[1] 2073.000 2068.890 2070.000 2105.910 2078.540 2094.210 2099.510
[8] 2090.250 2104.360 2100.000 2110.390 2119.270 2067.000 2025.010
[15] 2041.830 2067.450 2050.520 2056.520 2076.190 2067.210 2023.370
[22] 2073.120 2101.130 2070.000 2071.760 2074.060 2085.000 2062.300
[29] 2078.990 2076.030 2061.000 2042.050 2041.840 2051.700 2065.370
[36] 2044.810 2038.860 2027.880 2057.630 2059.120 2097.950 2152.940
[43] 2222.500 2226.130 2277.960 2256.700 2266.250 2261.470 2275.160
[50] 2276.980 2303.000 2291.980 2307.890 2285.250 2293.230 2283.470
[57] 2319.930 2336.000 2407.145 2410.330 2404.490 2402.720 2369.740
[64] 2368.420 2350.640 2400.000 2374.890 2291.860 2261.710 2261.090
[71] 2291.830 2309.320 2336.906 2264.400 2328.040 2365.990 2367.000
[78] 2420.000 2412.835 2436.940 2421.960 2422.000 2435.310 2395.020
[85] 2422.520 2451.320 2479.900 2499.500 2494.010 2524.920
boxplot.stats(dataset$volume)$out
[1] 3402357 4449022 3530169 3841482 4269902 4745183 3654385 3017947
[9] 2973891 2965771 3246573 3487056 3160585 3270248 3731589 2921393
[17] 3248393 4626086 3095263 5125791 3142760 4758496 3336352 3360727
[25] 3267883 3029471 3369275 4760260 3088305 3318204 4405584 2950120
[33] 4187586 3880723 3212657 4595891 3552194 6207027 5130576 2833483
[41] 4805752 3316905 3055216 3932954 2867053 2978300 3790618 3365365
[49] 4226748 3700125 4252365 3861489 4233435 3651106 3601750 4044137
[57] 3344450 4081528 3573755 3208495 2951309 3793630 3157875 4267698
[65] 3429036 3581072 3107763 3103882 2888827 4330862 3570927 4016353
[73] 4118170 2986439
boxplot.stats(dataset$adjClose)$out
[1] 2070.07 2062.37 2098.00 2092.91 2083.51 2095.38 2095.89 2104.11
[9] 2121.90 2128.31 2117.20 2101.14 2064.88 2070.86 2095.17 2031.36
[17] 2036.86 2081.51 2075.84 2026.71 2049.09 2108.54 2024.17 2052.70
[25] 2055.03 2114.77 2061.92 2066.49 2092.52 2091.08 2036.22 2043.20
[33] 2038.59 2052.96 2045.06 2044.36 2035.55 2055.95 2055.54 2068.63
[41] 2137.75 2225.55 2224.75 2249.68 2265.44 2285.88 2254.79 2267.27
[49] 2254.84 2296.66 2297.76 2302.40 2293.63 2293.29 2267.92 2315.30
[57] 2326.74 2307.12 2379.91 2429.89 2410.12 2395.17 2354.25 2356.74
[65] 2381.35 2398.69 2341.66 2308.76 2239.08 2261.97 2316.16 2321.41
[73] 2303.43 2308.71 2356.09 2345.10 2406.67 2409.07 2433.53 2402.51
[81] 2411.56 2429.81 2421.28 2404.61 2451.76 2466.09 2482.85 2491.40
[89] 2521.60 2513.93
boxplot.stats(dataset$adjHigh)$out
[1] 2116.500 2078.550 2102.510 2123.547 2105.130 2108.370 2102.030
[8] 2108.820 2152.680 2133.660 2132.735 2130.530 2091.420 2082.010
[15] 2100.780 2094.880 2071.010 2086.520 2104.370 2088.518 2089.240
[22] 2118.110 2128.810 2078.040 2075.000 2125.700 2090.260 2067.060
[29] 2123.560 2109.780 2075.500 2053.100 2057.990 2072.302 2078.210
[36] 2058.870 2050.990 2058.430 2070.780 2093.327 2142.940 2237.310
[43] 2237.660 2255.000 2284.005 2289.040 2275.320 2277.210 2277.990
[50] 2306.597 2306.440 2318.450 2309.600 2295.320 2303.762 2325.820
[57] 2341.260 2337.450 2452.378 2436.520 2427.140 2419.700 2379.260
[64] 2382.200 2382.710 2416.410 2378.000 2322.000 2285.370 2276.601
[71] 2321.140 2323.340 2343.150 2316.760 2360.340 2369.000 2418.480
[78] 2432.890 2442.944 2440.000 2428.140 2437.971 2442.000 2409.745
[85] 2453.859 2468.000 2494.495 2505.000 2523.260 2526.990
boxplot.stats(dataset$adjLow)$out
[1] 2018.380 2042.590 2059.330 2072.000 2078.540 2063.090 2077.320
[8] 2083.130 2104.360 2098.920 2103.710 2097.410 2062.140 2002.020
[15] 2038.130 2021.290 2016.060 2046.100 2071.260 2010.000 2020.270
[22] 2046.415 2021.610 2047.830 2033.370 2072.380 2047.550 2043.510
[29] 2070.000 2054.000 2033.550 2017.680 2026.070 2039.220 2041.555
[36] 2010.730 2014.020 2015.620 2044.030 2056.745 2096.890 2151.620
[43] 2214.800 2225.330 2257.680 2253.714 2238.465 2256.090 2249.190
[50] 2266.000 2284.450 2287.845 2271.710 2258.570 2256.450 2278.210
[57] 2313.840 2304.270 2374.850 2402.280 2402.160 2384.500 2311.700
[64] 2351.410 2342.338 2390.000 2334.730 2283.000 2230.050 2242.720
[71] 2283.320 2295.000 2303.160 2263.520 2321.090 2342.370 2360.110
[78] 2402.990 2412.515 2402.000 2407.690 2404.880 2404.200 2382.830
[85] 2417.770 2441.073 2468.240 2487.330 2494.000 2498.290
boxplot.stats(dataset$adjOpen)$out
[1] 2073.000 2068.890 2070.000 2105.910 2078.540 2094.210 2099.510
[8] 2090.250 2104.360 2100.000 2110.390 2119.270 2067.000 2025.010
[15] 2041.830 2067.450 2050.520 2056.520 2076.190 2067.210 2023.370
[22] 2073.120 2101.130 2070.000 2071.760 2074.060 2085.000 2062.300
[29] 2078.990 2076.030 2061.000 2042.050 2041.840 2051.700 2065.370
[36] 2044.810 2038.860 2027.880 2057.630 2059.120 2097.950 2152.940
[43] 2222.500 2226.130 2277.960 2256.700 2266.250 2261.470 2275.160
[50] 2276.980 2303.000 2291.980 2307.890 2285.250 2293.230 2283.470
[57] 2319.930 2336.000 2407.145 2410.330 2404.490 2402.720 2369.740
[64] 2368.420 2350.640 2400.000 2374.890 2291.860 2261.710 2261.090
[71] 2291.830 2309.320 2336.906 2264.400 2328.040 2365.990 2367.000
[78] 2420.000 2412.835 2436.940 2421.960 2422.000 2435.310 2395.020
[85] 2422.520 2451.320 2479.900 2499.500 2494.010 2524.920
boxplot.stats(dataset$adjVolume)$out
[1] 3402357 4449022 3530169 3841482 4269902 4745183 3654385 3017947
[9] 2973891 2965771 3246573 3487056 3160585 3270248 3731589 2921393
[17] 3248393 4626086 3095263 5125791 3142760 4758496 3336352 3360727
[25] 3267883 3029471 3369275 4760260 3088305 3318204 4405584 2950120
[33] 4187586 3880723 3212657 4595891 3552194 6207027 5130576 2833483
[41] 4805752 3316905 3055216 3932954 2867053 2978300 3790618 3365365
[49] 4226748 3700125 4252365 3861489 4233435 3651106 3601750 4044137
[57] 3344450 4081528 3573755 3208495 2951309 3793630 3157875 4267698
[65] 3429036 3581072 3107763 3103882 2888827 4330862 3570927 4016353
[73] 4118170 2986439
#boxplots
boxplot(dataset$close)
boxplot(dataset$high)
boxplot(dataset$low)
boxplot(dataset$open)
boxplot(dataset$volume)
boxplot(dataset$adjClose)
boxplot(dataset$adjHigh)
boxplot(dataset$adjLow)
boxplot(dataset$adjOpen)
boxplot(dataset$adjVolume)
Plotting methods
Scatter Plot
This scatter plot helps us to determine whether the closing price and volume are correlated to each other or not, it shows that the two attributes are corelated and have proportional relationship.
with(dataset, plot(volume, close))
The Bar plot represents the closing price and date in dataset. It indicates that closing prices at the end of a traded day are increasing or decreasing depending on the date.
barplot(height = dataset$close, names.arg = dataset$date, xlab = "Date", ylab = "Closing price", main = "date vs Close")
This Histogram represents the frequency of a stock closing price in the dataset. After observation, we noticed that the most values lie in between 1000 to 1200.
hist(dataset$close)
Here is our data set before preprocessing
#dataset before preprocessing
print(dataset)
Data cleaning, including handling missing values like NULLs, is crucial before utilizing data for analysis or modeling. It’s important to get the best quality of analysis. Such as accuracy where missing or incorrect data can skew analysis, leading to inaccurate insights or predictions. And clean data ensures the reliability of your findings, reducing the risk of making decisions based on flawed information.
to find the total null values in the dataset #Checking NULL, FALSE means no null, TRUE cells means the value of the cell is null
is.na(dataset)
date close high low open volume adjClose adjHigh
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[15,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[16,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[17,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[18,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[19,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[20,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[21,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[22,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[24,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[26,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[27,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[28,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[29,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[30,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[31,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[32,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[33,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[34,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[35,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[36,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[38,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[39,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[40,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[41,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[42,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[43,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[44,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[45,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[46,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[47,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[48,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[50,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[51,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[52,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[53,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[54,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[55,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[56,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[57,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[58,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[59,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[60,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[61,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[62,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[63,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[64,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[65,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[66,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[67,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[68,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[69,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[70,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[71,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[72,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[74,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[75,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[76,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[77,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[78,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[79,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[80,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[81,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[82,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[83,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[84,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[86,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[87,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[88,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[89,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[90,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
adjLow adjOpen adjVolume
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE FALSE FALSE
[5,] FALSE FALSE FALSE
[6,] FALSE FALSE FALSE
[7,] FALSE FALSE FALSE
[8,] FALSE FALSE FALSE
[9,] FALSE FALSE FALSE
[10,] FALSE FALSE FALSE
[11,] FALSE FALSE FALSE
[12,] FALSE FALSE FALSE
[13,] FALSE FALSE FALSE
[14,] FALSE FALSE FALSE
[15,] FALSE FALSE FALSE
[16,] FALSE FALSE FALSE
[17,] FALSE FALSE FALSE
[18,] FALSE FALSE FALSE
[19,] FALSE FALSE FALSE
[20,] FALSE FALSE FALSE
[21,] FALSE FALSE FALSE
[22,] FALSE FALSE FALSE
[23,] FALSE FALSE FALSE
[24,] FALSE FALSE FALSE
[25,] FALSE FALSE FALSE
[26,] FALSE FALSE FALSE
[27,] FALSE FALSE FALSE
[28,] FALSE FALSE FALSE
[29,] FALSE FALSE FALSE
[30,] FALSE FALSE FALSE
[31,] FALSE FALSE FALSE
[32,] FALSE FALSE FALSE
[33,] FALSE FALSE FALSE
[34,] FALSE FALSE FALSE
[35,] FALSE FALSE FALSE
[36,] FALSE FALSE FALSE
[37,] FALSE FALSE FALSE
[38,] FALSE FALSE FALSE
[39,] FALSE FALSE FALSE
[40,] FALSE FALSE FALSE
[41,] FALSE FALSE FALSE
[42,] FALSE FALSE FALSE
[43,] FALSE FALSE FALSE
[44,] FALSE FALSE FALSE
[45,] FALSE FALSE FALSE
[46,] FALSE FALSE FALSE
[47,] FALSE FALSE FALSE
[48,] FALSE FALSE FALSE
[49,] FALSE FALSE FALSE
[50,] FALSE FALSE FALSE
[51,] FALSE FALSE FALSE
[52,] FALSE FALSE FALSE
[53,] FALSE FALSE FALSE
[54,] FALSE FALSE FALSE
[55,] FALSE FALSE FALSE
[56,] FALSE FALSE FALSE
[57,] FALSE FALSE FALSE
[58,] FALSE FALSE FALSE
[59,] FALSE FALSE FALSE
[60,] FALSE FALSE FALSE
[61,] FALSE FALSE FALSE
[62,] FALSE FALSE FALSE
[63,] FALSE FALSE FALSE
[64,] FALSE FALSE FALSE
[65,] FALSE FALSE FALSE
[66,] FALSE FALSE FALSE
[67,] FALSE FALSE FALSE
[68,] FALSE FALSE FALSE
[69,] FALSE FALSE FALSE
[70,] FALSE FALSE FALSE
[71,] FALSE FALSE FALSE
[72,] FALSE FALSE FALSE
[73,] FALSE FALSE FALSE
[74,] FALSE FALSE FALSE
[75,] FALSE FALSE FALSE
[76,] FALSE FALSE FALSE
[77,] FALSE FALSE FALSE
[78,] FALSE FALSE FALSE
[79,] FALSE FALSE FALSE
[80,] FALSE FALSE FALSE
[81,] FALSE FALSE FALSE
[82,] FALSE FALSE FALSE
[83,] FALSE FALSE FALSE
[84,] FALSE FALSE FALSE
[85,] FALSE FALSE FALSE
[86,] FALSE FALSE FALSE
[87,] FALSE FALSE FALSE
[88,] FALSE FALSE FALSE
[89,] FALSE FALSE FALSE
[90,] FALSE FALSE FALSE
[ reached getOption("max.print") -- omitted 1168 rows ]
sum(is.na(dataset))
[1] 0
print("Since there is no NULL values we don't need to remove any rows")
[1] "Since there is no NULL values we don't need to remove any rows"
In our data since there are no Null values, we don’t need to remove any rows.
Since most attributes in our dataset are numeric and removing outliers will affect our calculations and prediction, we will remove closing price and volumes outliers only.
#dataset before removing outliers
print(dataset)
summary(dataset)
date close high
Min. :2016-06-14 Min. : 668.3 Min. : 672.3
1st Qu.:2017-09-12 1st Qu.: 960.8 1st Qu.: 968.8
Median :2018-12-11 Median :1132.5 Median :1143.9
Mean :2018-12-12 Mean :1216.3 Mean :1227.4
3rd Qu.:2020-03-12 3rd Qu.:1360.6 3rd Qu.:1374.3
Max. :2021-06-11 Max. :2521.6 Max. :2527.0
low open volume adjClose
Min. : 663.3 Min. : 671 Min. : 346753 Min. : 668.3
1st Qu.: 952.2 1st Qu.: 959 1st Qu.:1173522 1st Qu.: 960.8
Median :1117.9 Median :1131 Median :1412588 Median :1132.5
Mean :1204.2 Mean :1215 Mean :1601590 Mean :1216.3
3rd Qu.:1348.6 3rd Qu.:1361 3rd Qu.:1812156 3rd Qu.:1360.6
Max. :2498.3 Max. :2525 Max. :6207027 Max. :2521.6
adjHigh adjLow adjOpen adjVolume
Min. : 672.3 Min. : 663.3 Min. : 671 Min. : 346753
1st Qu.: 968.8 1st Qu.: 952.2 1st Qu.: 959 1st Qu.:1173522
Median :1143.9 Median :1117.9 Median :1131 Median :1412588
Mean :1227.4 Mean :1204.2 Mean :1215 Mean :1601590
3rd Qu.:1374.3 3rd Qu.:1348.6 3rd Qu.:1361 3rd Qu.:1812156
Max. :2527.0 Max. :2498.3 Max. :2525 Max. :6207027
str(dataset)
'data.frame': 1258 obs. of 11 variables:
$ date : Date, format: "2016-06-14" ...
$ close : num 718 719 710 692 694 ...
$ high : num 722 723 717 709 702 ...
$ low : num 713 717 703 688 693 ...
$ open : num 716 719 715 709 699 ...
$ volume : int 1306065 1214517 1982471 3402357 2082538 1465634 1184318 2171415 4449022 2641085 ...
$ adjClose : num 718 719 710 692 694 ...
$ adjHigh : num 722 723 717 709 702 ...
$ adjLow : num 713 717 703 688 693 ...
$ adjOpen : num 716 719 715 709 699 ...
$ adjVolume: int 1306065 1214517 1982471 3402357 2082538 1465634 1184318 2171415 4449022 2641085 ...
#removing close outlier
outliers <- boxplot(dataset$close, plot=FALSE)$out
dataset <- dataset[-which(dataset$close %in% outliers),]
boxplot.stats(dataset$close)$out
[1] 1749.13 1763.37 1761.75 1763.00 1752.71 1749.84 1777.02 1781.38
[9] 1770.15 1746.78 1763.92 1768.88 1771.43 1793.19 1760.74 1798.10
[17] 1827.95 1826.77 1827.99 1819.48 1818.55 1784.13 1775.33 1781.77
[25] 1760.06 1767.77 1763.00 1747.90 1776.09 1758.72 1751.88 1787.25
[33] 1807.21 1766.72 1746.55 1754.40 1790.86 1886.90 1891.25 1901.05
[41] 1899.40 1917.24 1830.79 1863.11 1835.74 1901.35 1927.51
#removing volume's outlier
outliers <- boxplot(dataset$volume, plot=FALSE)$out
dataset <- dataset[-which(dataset$volume %in% outliers),]
boxplot.stats(dataset$volume)$out
[1] 2641085 2700470 2749221 2607121 2553771 2712222 2634669 2720942
[9] 2560277 2580374 2558385 2726830 2680400 2619234 2675742 2580612
[17] 2769225 2673464 2576470 2642983 2597455 2561288 2660628 2611373
[25] 2611229 2574061 2664723 2668906 2608568 2610884 2568345 2636142
[33] 2602114 2748292
#data set after removing outliers
print(dataset)
summary(dataset)
date close high
Min. :2016-06-14 Min. : 668.3 Min. : 672.3
1st Qu.:2017-08-10 1st Qu.: 942.2 1st Qu.: 943.8
Median :2018-09-29 Median :1115.7 Median :1125.6
Mean :2018-10-03 Mean :1139.9 Mean :1149.5
3rd Qu.:2019-11-14 3rd Qu.:1264.7 3rd Qu.:1275.7
Max. :2021-02-02 Max. :1927.5 Max. :1955.8
low open volume
Min. : 663.3 Min. : 671.0 Min. : 346753
1st Qu.: 933.8 1st Qu.: 939.7 1st Qu.:1167344
Median :1104.2 Median :1115.8 Median :1394116
Mean :1129.1 Mean :1138.6 Mean :1480717
3rd Qu.:1251.1 3rd Qu.:1262.0 3rd Qu.:1719968
Max. :1914.5 Max. :1922.6 Max. :2769225
adjClose adjHigh adjLow
Min. : 668.3 Min. : 672.3 Min. : 663.3
1st Qu.: 942.2 1st Qu.: 943.8 1st Qu.: 933.8
Median :1115.7 Median :1125.6 Median :1104.2
Mean :1139.9 Mean :1149.5 Mean :1129.1
3rd Qu.:1264.7 3rd Qu.:1275.7 3rd Qu.:1251.1
Max. :1927.5 Max. :1955.8 Max. :1914.5
adjOpen adjVolume
Min. : 671.0 Min. : 346753
1st Qu.: 939.7 1st Qu.:1167344
Median :1115.8 Median :1394116
Mean :1138.6 Mean :1480717
3rd Qu.:1262.0 3rd Qu.:1719968
Max. :1922.6 Max. :2769225
str(dataset)
'data.frame': 1096 obs. of 11 variables:
$ date : Date, format: "2016-06-14" ...
$ close : num 718 719 710 694 696 ...
$ high : num 722 723 717 702 703 ...
$ low : num 713 717 703 693 692 ...
$ open : num 716 719 715 699 698 ...
$ volume : int 1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...
$ adjClose : num 718 719 710 694 696 ...
$ adjHigh : num 722 723 717 702 703 ...
$ adjLow : num 713 717 703 693 692 ...
$ adjOpen : num 716 719 715 699 698 ...
$ adjVolume: int 1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...
Feature selection
Remove Redundant Features
# load the library
library(mlbench)
Warning: package ‘mlbench’ was built under R version 4.3.2
library(caret)
library(ggplot2)
library(lattice)
# calculate correlation matrix
correlationMatrix <- cor(dataset[,2:11])
# summarize the correlation matrix
print(correlationMatrix)
close high low open volume
close 1.0000000 0.9993759 0.9994124 0.9986066 0.1155092
high 0.9993759 1.0000000 0.9992333 0.9993994 0.1278230
low 0.9994124 0.9992333 1.0000000 0.9993082 0.1038372
open 0.9986066 0.9993994 0.9993082 1.0000000 0.1177215
volume 0.1155092 0.1278230 0.1038372 0.1177215 1.0000000
adjClose 1.0000000 0.9993759 0.9994124 0.9986066 0.1155092
adjHigh 0.9993759 1.0000000 0.9992333 0.9993994 0.1278230
adjLow 0.9994124 0.9992333 1.0000000 0.9993082 0.1038372
adjOpen 0.9986066 0.9993994 0.9993082 1.0000000 0.1177215
adjVolume 0.1155092 0.1278230 0.1038372 0.1177215 1.0000000
adjClose adjHigh adjLow adjOpen adjVolume
close 1.0000000 0.9993759 0.9994124 0.9986066 0.1155092
high 0.9993759 1.0000000 0.9992333 0.9993994 0.1278230
low 0.9994124 0.9992333 1.0000000 0.9993082 0.1038372
open 0.9986066 0.9993994 0.9993082 1.0000000 0.1177215
volume 0.1155092 0.1278230 0.1038372 0.1177215 1.0000000
adjClose 1.0000000 0.9993759 0.9994124 0.9986066 0.1155092
adjHigh 0.9993759 1.0000000 0.9992333 0.9993994 0.1278230
adjLow 0.9994124 0.9992333 1.0000000 0.9993082 0.1038372
adjOpen 0.9986066 0.9993994 0.9993082 1.0000000 0.1177215
adjVolume 0.1155092 0.1278230 0.1038372 0.1177215 1.0000000
# find attributes that are highly corrected (ideally >0.75)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5 )
# print indexes of highly correlated attributes
print(highlyCorrelated)
[1] 7 2 4 9 1 6 8 5
dataset before normalization
#dataset before normalization
print(dataset)
summary(dataset)
date close high
Min. :2016-06-14 Min. : 668.3 Min. : 672.3
1st Qu.:2017-08-10 1st Qu.: 942.2 1st Qu.: 943.8
Median :2018-09-29 Median :1115.7 Median :1125.6
Mean :2018-10-03 Mean :1139.9 Mean :1149.5
3rd Qu.:2019-11-14 3rd Qu.:1264.7 3rd Qu.:1275.7
Max. :2021-02-02 Max. :1927.5 Max. :1955.8
low open volume
Min. : 663.3 Min. : 671.0 Min. : 346753
1st Qu.: 933.8 1st Qu.: 939.7 1st Qu.:1167344
Median :1104.2 Median :1115.8 Median :1394116
Mean :1129.1 Mean :1138.6 Mean :1480717
3rd Qu.:1251.1 3rd Qu.:1262.0 3rd Qu.:1719968
Max. :1914.5 Max. :1922.6 Max. :2769225
adjClose adjHigh adjLow
Min. : 668.3 Min. : 672.3 Min. : 663.3
1st Qu.: 942.2 1st Qu.: 943.8 1st Qu.: 933.8
Median :1115.7 Median :1125.6 Median :1104.2
Mean :1139.9 Mean :1149.5 Mean :1129.1
3rd Qu.:1264.7 3rd Qu.:1275.7 3rd Qu.:1251.1
Max. :1927.5 Max. :1955.8 Max. :1914.5
adjOpen adjVolume
Min. : 671.0 Min. : 346753
1st Qu.: 939.7 1st Qu.:1167344
Median :1115.8 Median :1394116
Mean :1138.6 Mean :1480717
3rd Qu.:1262.0 3rd Qu.:1719968
Max. :1922.6 Max. :2769225
str(dataset)
'data.frame': 1096 obs. of 11 variables:
$ date : Date, format: "2016-06-14" ...
$ close : num 718 719 710 694 696 ...
$ high : num 722 723 717 702 703 ...
$ low : num 713 717 703 693 692 ...
$ open : num 716 719 715 699 698 ...
$ volume : int 1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...
$ adjClose : num 718 719 710 694 696 ...
$ adjHigh : num 722 723 717 702 703 ...
$ adjLow : num 713 717 703 693 692 ...
$ adjOpen : num 716 719 715 699 698 ...
$ adjVolume: int 1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...
normalization was performed to ensure consistent scaling of the data. The normalization technique applied was the max-min normalization. This technique rescales the values of specific attributes within a defined range between 0 and 1.
We can use the normalized dataset provides a more uniform and comparable representation of the attributes, enabling accurate analysis and modeling for stock predaction with result as shown.
normalize <- function(x) {return ((x - min(x)) / (max(x) - min(x)))}
dataWithoutNormalization <- dataset
dataset$close<-normalize(dataWithoutNormalization$close)
dataset$volume<-normalize(dataWithoutNormalization$volume)
dataset$open<-normalize(dataWithoutNormalization$open)
dataset$low <-normalize(dataWithoutNormalization$low)
dataset$high <-normalize(dataWithoutNormalization$high)
dataset after normalization
#dataset after normalization
print(dataset)
summary(dataset)
date close high
Min. :2016-06-14 Min. :0.0000 Min. :0.0000
1st Qu.:2017-08-10 1st Qu.:0.2175 1st Qu.:0.2115
Median :2018-09-29 Median :0.3553 Median :0.3532
Mean :2018-10-03 Mean :0.3746 Mean :0.3718
3rd Qu.:2019-11-14 3rd Qu.:0.4736 3rd Qu.:0.4701
Max. :2021-02-02 Max. :1.0000 Max. :1.0000
low open volume
Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.2162 1st Qu.:0.2147 1st Qu.:0.3387
Median :0.3524 Median :0.3554 Median :0.4324
Mean :0.3723 Mean :0.3736 Mean :0.4681
3rd Qu.:0.4698 3rd Qu.:0.4722 3rd Qu.:0.5669
Max. :1.0000 Max. :1.0000 Max. :1.0000
adjClose adjHigh adjLow
Min. : 668.3 Min. : 672.3 Min. : 663.3
1st Qu.: 942.2 1st Qu.: 943.8 1st Qu.: 933.8
Median :1115.7 Median :1125.6 Median :1104.2
Mean :1139.9 Mean :1149.5 Mean :1129.1
3rd Qu.:1264.7 3rd Qu.:1275.7 3rd Qu.:1251.1
Max. :1927.5 Max. :1955.8 Max. :1914.5
adjOpen adjVolume
Min. : 671.0 Min. : 346753
1st Qu.: 939.7 1st Qu.:1167344
Median :1115.8 Median :1394116
Mean :1138.6 Mean :1480717
3rd Qu.:1262.0 3rd Qu.:1719968
Max. :1922.6 Max. :2769225
str(dataset)
'data.frame': 1096 obs. of 11 variables:
$ date : Date, format: "2016-06-14" ...
$ close : num 0.0397 0.0402 0.0334 0.0202 0.022 ...
$ high : num 0.0391 0.0395 0.0346 0.0235 0.0237 ...
$ low : num 0.0398 0.0432 0.0319 0.0241 0.023 ...
$ open : num 0.0363 0.0384 0.0351 0.0222 0.0219 ...
$ volume : num 0.396 0.358 0.675 0.717 0.462 ...
$ adjClose : num 718 719 710 694 696 ...
$ adjHigh : num 722 723 717 702 703 ...
$ adjLow : num 713 717 703 693 692 ...
$ adjOpen : num 716 719 715 699 698 ...
$ adjVolume: int 1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...
dataset before Discretization
#dataset before Discretization
print(dataset)
summary(dataset)
date close high
Min. :2016-06-14 Min. :0.0000 Min. :0.0000
1st Qu.:2017-08-10 1st Qu.:0.2175 1st Qu.:0.2115
Median :2018-09-29 Median :0.3553 Median :0.3532
Mean :2018-10-03 Mean :0.3746 Mean :0.3718
3rd Qu.:2019-11-14 3rd Qu.:0.4736 3rd Qu.:0.4701
Max. :2021-02-02 Max. :1.0000 Max. :1.0000
low open volume
Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.2162 1st Qu.:0.2147 1st Qu.:0.3387
Median :0.3524 Median :0.3554 Median :0.4324
Mean :0.3723 Mean :0.3736 Mean :0.4681
3rd Qu.:0.4698 3rd Qu.:0.4722 3rd Qu.:0.5669
Max. :1.0000 Max. :1.0000 Max. :1.0000
adjClose adjHigh adjLow
Min. : 668.3 Min. : 672.3 Min. : 663.3
1st Qu.: 942.2 1st Qu.: 943.8 1st Qu.: 933.8
Median :1115.7 Median :1125.6 Median :1104.2
Mean :1139.9 Mean :1149.5 Mean :1129.1
3rd Qu.:1264.7 3rd Qu.:1275.7 3rd Qu.:1251.1
Max. :1927.5 Max. :1955.8 Max. :1914.5
adjOpen adjVolume
Min. : 671.0 Min. : 346753
1st Qu.: 939.7 1st Qu.:1167344
Median :1115.8 Median :1394116
Mean :1138.6 Mean :1480717
3rd Qu.:1262.0 3rd Qu.:1719968
Max. :1922.6 Max. :2769225
str(dataset)
'data.frame': 1096 obs. of 11 variables:
$ date : Date, format: "2016-06-14" ...
$ close : num 0.0397 0.0402 0.0334 0.0202 0.022 ...
$ high : num 0.0391 0.0395 0.0346 0.0235 0.0237 ...
$ low : num 0.0398 0.0432 0.0319 0.0241 0.023 ...
$ open : num 0.0363 0.0384 0.0351 0.0222 0.0219 ...
$ volume : num 0.396 0.358 0.675 0.717 0.462 ...
$ adjClose : num 718 719 710 694 696 ...
$ adjHigh : num 722 723 717 702 703 ...
$ adjLow : num 713 717 703 693 692 ...
$ adjOpen : num 716 719 715 699 698 ...
$ adjVolume: int 1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...
we used the Discretization technique on our class label “close” to simplify it as it has a large continuous values, we made them fall into intervals, to make it easier to analyze
and we chose the value 0.2957251 as it the mean value for the closing
dataset$close <- ifelse(dataset$close <= 0.2957251 , "low","High")
print(dataset)
we discretized it into two categories (low, high) based on the maen, low meaning it is less than the mean of the close , and high meaning it is equal to or higher than the mean.
Encoding We encoded close data into factors, which would help the model read this data easily
dataset$close <- factor(dataset$close,levels = c("low", "High"), labels = c("1", "2"))
print(dataset)
dataset after Discretization
#dataset after Discretization
print(dataset)
summary(dataset)
date close high low
Min. :2016-06-14 1:396 Min. :0.0000 Min. :0.0000
1st Qu.:2017-08-10 2:700 1st Qu.:0.2115 1st Qu.:0.2162
Median :2018-09-29 Median :0.3532 Median :0.3524
Mean :2018-10-03 Mean :0.3718 Mean :0.3723
3rd Qu.:2019-11-14 3rd Qu.:0.4701 3rd Qu.:0.4698
Max. :2021-02-02 Max. :1.0000 Max. :1.0000
open volume adjClose
Min. :0.0000 Min. :0.0000 Min. : 668.3
1st Qu.:0.2147 1st Qu.:0.3387 1st Qu.: 942.2
Median :0.3554 Median :0.4324 Median :1115.7
Mean :0.3736 Mean :0.4681 Mean :1139.9
3rd Qu.:0.4722 3rd Qu.:0.5669 3rd Qu.:1264.7
Max. :1.0000 Max. :1.0000 Max. :1927.5
adjHigh adjLow adjOpen
Min. : 672.3 Min. : 663.3 Min. : 671.0
1st Qu.: 943.8 1st Qu.: 933.8 1st Qu.: 939.7
Median :1125.6 Median :1104.2 Median :1115.8
Mean :1149.5 Mean :1129.1 Mean :1138.6
3rd Qu.:1275.7 3rd Qu.:1251.1 3rd Qu.:1262.0
Max. :1955.8 Max. :1914.5 Max. :1922.6
adjVolume
Min. : 346753
1st Qu.:1167344
Median :1394116
Mean :1480717
3rd Qu.:1719968
Max. :2769225
str(dataset)
'data.frame': 1096 obs. of 11 variables:
$ date : Date, format: "2016-06-14" ...
$ close : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ high : num 0.0391 0.0395 0.0346 0.0235 0.0237 ...
$ low : num 0.0398 0.0432 0.0319 0.0241 0.023 ...
$ open : num 0.0363 0.0384 0.0351 0.0222 0.0219 ...
$ volume : num 0.396 0.358 0.675 0.717 0.462 ...
$ adjClose : num 718 719 710 694 696 ...
$ adjHigh : num 722 723 717 702 703 ...
$ adjLow : num 713 717 703 693 692 ...
$ adjOpen : num 716 719 715 699 698 ...
$ adjVolume: int 1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...
summary after preprocessing after preprocessing the data for stock price prediction, several steps are taken to refine, clean, and prepare the data for analysis and modeling. These preprocessing steps aim to enhance the quality and reliability of the data for more accurate stock price prediction.
dataset after preprocessing
#dataset after preprocessing
print(dataset)
summary(dataset)
date close high low
Min. :2016-06-14 1:396 Min. :0.0000 Min. :0.0000
1st Qu.:2017-08-10 2:700 1st Qu.:0.2115 1st Qu.:0.2162
Median :2018-09-29 Median :0.3532 Median :0.3524
Mean :2018-10-03 Mean :0.3718 Mean :0.3723
3rd Qu.:2019-11-14 3rd Qu.:0.4701 3rd Qu.:0.4698
Max. :2021-02-02 Max. :1.0000 Max. :1.0000
open volume adjClose
Min. :0.0000 Min. :0.0000 Min. : 668.3
1st Qu.:0.2147 1st Qu.:0.3387 1st Qu.: 942.2
Median :0.3554 Median :0.4324 Median :1115.7
Mean :0.3736 Mean :0.4681 Mean :1139.9
3rd Qu.:0.4722 3rd Qu.:0.5669 3rd Qu.:1264.7
Max. :1.0000 Max. :1.0000 Max. :1927.5
adjHigh adjLow adjOpen
Min. : 672.3 Min. : 663.3 Min. : 671.0
1st Qu.: 943.8 1st Qu.: 933.8 1st Qu.: 939.7
Median :1125.6 Median :1104.2 Median :1115.8
Mean :1149.5 Mean :1129.1 Mean :1138.6
3rd Qu.:1275.7 3rd Qu.:1251.1 3rd Qu.:1262.0
Max. :1955.8 Max. :1914.5 Max. :1922.6
adjVolume
Min. : 346753
1st Qu.:1167344
Median :1394116
Mean :1480717
3rd Qu.:1719968
Max. :2769225
str(dataset)
'data.frame': 1096 obs. of 11 variables:
$ date : Date, format: "2016-06-14" ...
$ close : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
$ high : num 0.0391 0.0395 0.0346 0.0235 0.0237 ...
$ low : num 0.0398 0.0432 0.0319 0.0241 0.023 ...
$ open : num 0.0363 0.0384 0.0351 0.0222 0.0219 ...
$ volume : num 0.396 0.358 0.675 0.717 0.462 ...
$ adjClose : num 718 719 710 694 696 ...
$ adjHigh : num 722 723 717 702 703 ...
$ adjLow : num 713 717 703 693 692 ...
$ adjOpen : num 716 719 715 699 698 ...
$ adjVolume: int 1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...
Feature selection
Feature selection is a process of selecting a subset of relevant features (or attributes) from the original set of features in a dataset. The goal of feature selection is to choose the most relevant and important features, thereby reducing dimensionality, and improving model performance.
#Feature selection ,Feature selection using Recursive Feature Elimination or RFE
library(mlbench)
library(caret)
# define the control using a random forest selection function
# number=12 means the length of the list
control <- rfeControl(functions=rfFuncs, method="cv", number=11)
# run the RFE algorithm from column 1 to 11
results <- rfe(dataset[,1:10],dataset[,11], sizes=c(1:10), rfeControl=control)
summarize the results
print(results)
Recursive feature selection
Outer resampling method: Cross-Validated (11 fold)
Resampling performance over subset size:
The top 1 variables (out of 1):
volume
list the chosen features
predictors(results)
[1] "volume"
plot the results
plot(results, type=c("h", "o"))
We did both supervised and unsupervised learning techniques on our dataset (Google stock prediction), which involves classification and clustering methods, for classification we did a partitioning method called the train-test split, which splits the dataset into two subsets of different ratios, and we implemented three algorithms to form 9 different decision trees.
We will choose the attributes with the highest importance (from feature selection) to create a tree:
we divided our dataset into two divisions for each split:
first one 70-30, which means Training(70%) and Testing(30%):
# a fixed random seed to make results reproducible
set.seed(1234)
# 1.Split the datasets into two subsets: Training(70%) and Testing(30%):
ind1 <- sample(2, nrow(dataset), replace=TRUE, prob=c( 0.70, 0.30))
trainData <- dataset[ind1==1,]
testData <- dataset[ind1==2,]
library(party)
Loading required package: grid
Loading required package: mvtnorm
Loading required package: modeltools
Loading required package: stats4
Loading required package: strucchange
Loading required package: zoo
Attaching package: ‘zoo’
The following objects are masked from ‘package:base’:
as.Date, as.Date.numeric
Loading required package: sandwich
#myFormula
myFormula <- close ~volume+open+high+low
Information gain is a concept used in the field of machine learning and decision tree algorithms. It is a measure of the effectiveness of a particular attribute in classifying data. In the context of decision trees, information gain helps determine the order in which attributes are chosen for splitting the data.
dataset_ctree <- ctree(myFormula, data=trainData)
table(predict(dataset_ctree), trainData$close)
1 2
1 284 11
2 0 476
# 4.Print and plot the tree:
print(dataset_ctree)
Conditional inference tree with 4 terminal nodes
Response: close
Inputs: volume, open, high, low
Number of observations: 771
1) open <= 0.2974608; criterion = 1, statistic = 423.273
2) high <= 0.2892353; criterion = 1, statistic = 19.817
3)* weights = 267
2) high > 0.2892353
4)* weights = 17
1) open > 0.2974608
5) low <= 0.2955676; criterion = 0.995, statistic = 10.36
6)* weights = 11
5) low > 0.2955676
7)* weights = 476
plot(dataset_ctree, type="simple")
# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
testPred 1 2
1 111 3
2 1 210
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/e1071_1.7-13.zip'
Content type 'application/zip' length 653332 bytes (638 KB)
downloaded 638 KB
package ‘e1071’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\shade\AppData\Local\Temp\RtmpIFXKiG\downloaded_packages
library(e1071)
Warning: package ‘e1071’ was built under R version 4.3.2
library(caret)
co_result <- confusionMatrix(result)
print(co_result)
Confusion Matrix and Statistics
testPred 1 2
1 111 3
2 1 210
Accuracy : 0.9877
95% CI : (0.9688, 0.9966)
No Information Rate : 0.6554
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9729
Mcnemar's Test P-Value : 0.6171
Sensitivity : 0.9911
Specificity : 0.9859
Pos Pred Value : 0.9737
Neg Pred Value : 0.9953
Prevalence : 0.3446
Detection Rate : 0.3415
Detection Prevalence : 0.3508
Balanced Accuracy : 0.9885
'Positive' Class : 1
sensitivity(as.table(co_result))
[1] 0.9910714
specificity(as.table(co_result))
[1] 0.9859155
precision(as.table(co_result))
[1] 0.9736842
acc <- co_result$overall["Accuracy"]
acc
Accuracy
0.9876923
The Gini Index is another criterion used in decision tree algorithms, particularly in the context of the Classification and Regression Trees (CART) algorithm. Like information gain, the Gini Index is used to evaluate the impurity or homogeneity of a dataset.
The Gini Index for a specific attribute measures the probability of incorrectly classifying a randomly chosen element in the dataset. A lower Gini Index indicates a purer or more homogeneous set. In the context of decision trees, the attribute with the lowest Gini Index is chosen as the split attribute.
# For decision tree model
install.packages("rpart")
Error in install.packages : Updating loaded packages
library(rpart)
# For data visualization
library(rpart.plot)
Warning: package ‘rpart.plot’ was built under R version 4.3.2
dataset.cart <- rpart(myFormula, data = trainData, method = "class", parms = list(split = "gini"))
Visualizing the unpruned tree
library(rpart.plot)
rpart.plot(dataset.cart)
Checking the order of variable importance
dataset.cart$variable.importance
low high open volume
343.117705 330.102896 326.553402 4.732658
pred.tree = predict(dataset.cart, testData, type = "class")
table(pred.tree,testData$close)
pred.tree 1 2
1 111 3
2 1 210
# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
testPred 1 2
1 111 3
2 1 210
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
Error in install.packages : Updating loaded packages
library(e1071)
library(caret)
co_result <- confusionMatrix(result)
print(co_result)
Confusion Matrix and Statistics
testPred 1 2
1 111 3
2 1 210
Accuracy : 0.9877
95% CI : (0.9688, 0.9966)
No Information Rate : 0.6554
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9729
Mcnemar's Test P-Value : 0.6171
Sensitivity : 0.9911
Specificity : 0.9859
Pos Pred Value : 0.9737
Neg Pred Value : 0.9953
Prevalence : 0.3446
Detection Rate : 0.3415
Detection Prevalence : 0.3508
Balanced Accuracy : 0.9885
'Positive' Class : 1
sensitivity(as.table(co_result))
[1] 0.9910714
specificity(as.table(co_result))
[1] 0.9859155
precision(as.table(co_result))
[1] 0.9736842
acc <- co_result$overall["Accuracy"]
acc
Accuracy
0.9876923
The Gain Ratio is used to select the attribute that maximizes the Information Gain while avoiding the bias towards attributes with many values. It provides a more balanced measure for attribute selection in decision tree construction.
While Information Gain simply measures the reduction in entropy or uncertainty, Gain Ratio takes into account the intrinsic information of an attribute. It aims to penalize attributes that may have a large number of values, potentially leading to overfitting.
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages("C50")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/C50_0.1.8.zip'
Content type 'application/zip' length 342652 bytes (334 KB)
downloaded 334 KB
package ‘C50’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\shade\AppData\Local\Temp\RtmpIFXKiG\downloaded_packages
install.packages("printr")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/printr_0.3.zip'
Content type 'application/zip' length 39419 bytes (38 KB)
downloaded 38 KB
package ‘printr’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\shade\AppData\Local\Temp\RtmpIFXKiG\downloaded_packages
library(C50)
Warning: package ‘C50’ was built under R version 4.3.2
library(printr)
Warning: package ‘printr’ was built under R version 4.3.2Registered S3 method overwritten by 'printr':
method from
knit_print.data.frame rmarkdown
library(caret)
#train using the trainData and create the c5.0 gain ratio tree
CloseTree <- C5.0(myFormula, data=trainData)
summary(CloseTree)
Call:
C5.0.formula(formula = myFormula, data = trainData)
C5.0 [Release 2.07 GPL Edition] Fri Dec 1 06:15:18 2023
-------------------------------
Class specified by attribute `outcome'
Read 771 cases (5 attributes) from undefined.data
Decision tree:
low > 0.2960392: 2 (481/1)
low <= 0.2960392:
:...high <= 0.2892354: 1 (266)
high > 0.2892354:
:...high > 0.3075281: 2 (2)
high <= 0.3075281:
:...open <= 0.278852: 2 (2)
open > 0.278852: 1 (20/3)
Evaluation on training data (771 cases):
Decision Tree
----------------
Size Errors
5 4( 0.5%) <<
(a) (b) <-classified as
---- ----
283 1 (a): class 1
3 484 (b): class 2
Attribute usage:
100.00% low
37.61% high
2.85% open
Time: 0.0 secs
plot(CloseTree)
second one 60-40, which means Training(60%) and Testing(40%):
# a fixed random seed to make results reproducible
set.seed(1234)
# 1.Split the datasets into two subsets: Training(60%) and Testing(40%):
ind1 <- sample(2, nrow(dataset), replace=TRUE, prob=c(0.60 , 0.40))
trainData <- dataset[ind1==1,]
testData <- dataset[ind1==2,]
library(party)
#myFormula
myFormula <- close ~volume+open+high+low
dataset_ctree <- ctree(myFormula, data=trainData)
table(predict(dataset_ctree), trainData$close)
1 2
1 249 8
2 0 398
# 4.Print and plot the tree:
print(dataset_ctree)
Conditional inference tree with 4 terminal nodes
Response: close
Inputs: volume, open, high, low
Number of observations: 655
1) open <= 0.2974608; criterion = 1, statistic = 363.998
2) high <= 0.2892353; criterion = 0.998, statistic = 11.719
3)* weights = 235
2) high > 0.2892353
4)* weights = 12
1) open > 0.2974608
5) low <= 0.2955676; criterion = 0.987, statistic = 8.71
6)* weights = 10
5) low > 0.2955676
7)* weights = 398
plot(dataset_ctree, type="simple")
# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
testPred 1 2
1 146 6
2 1 288
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
Error in install.packages : Updating loaded packages
library(e1071)
library(caret)
co_result <- confusionMatrix(result)
print(co_result)
Confusion Matrix and Statistics
testPred 1 2
1 146 6
2 1 288
Accuracy : 0.9841
95% CI : (0.9676, 0.9936)
No Information Rate : 0.6667
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9646
Mcnemar's Test P-Value : 0.1306
Sensitivity : 0.9932
Specificity : 0.9796
Pos Pred Value : 0.9605
Neg Pred Value : 0.9965
Prevalence : 0.3333
Detection Rate : 0.3311
Detection Prevalence : 0.3447
Balanced Accuracy : 0.9864
'Positive' Class : 1
sensitivity(as.table(co_result))
[1] 0.9931973
specificity(as.table(co_result))
[1] 0.9795918
precision(as.table(co_result))
[1] 0.9605263
acc <- co_result$overall["Accuracy"]
acc
Accuracy
0.984127
# For decision tree model
install.packages("rpart")
Error in install.packages : Updating loaded packages
library(rpart)
# For data visualization
library(rpart.plot)
dataset.cart <- rpart(myFormula, data = trainData, method = "class", parms = list(split = "gini"))
Visualizing the unpruned tree
rpart.plot(dataset.cart)
Checking the order of variable importance
dataset.cart$variable.importance
low high open volume
294.972422 284.520643 282.198025 4.645235
pred.tree = predict(dataset.cart, testData, type = "class")
table(pred.tree,testData$close)
pred.tree 1 2
1 146 4
2 1 290
# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
testPred 1 2
1 146 6
2 1 288
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
Error in install.packages : Updating loaded packages
library(e1071)
library(caret)
co_result <- confusionMatrix(result)
print(co_result)
Confusion Matrix and Statistics
testPred 1 2
1 146 6
2 1 288
Accuracy : 0.9841
95% CI : (0.9676, 0.9936)
No Information Rate : 0.6667
P-Value [Acc > NIR] : <2e-16
Kappa : 0.9646
Mcnemar's Test P-Value : 0.1306
Sensitivity : 0.9932
Specificity : 0.9796
Pos Pred Value : 0.9605
Neg Pred Value : 0.9965
Prevalence : 0.3333
Detection Rate : 0.3311
Detection Prevalence : 0.3447
Balanced Accuracy : 0.9864
'Positive' Class : 1
sensitivity(as.table(co_result))
[1] 0.9931973
specificity(as.table(co_result))
[1] 0.9795918
precision(as.table(co_result))
[1] 0.9605263
acc <- co_result$overall["Accuracy"]
acc
Accuracy
0.984127
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages("C50")
Error in install.packages : Updating loaded packages
install.packages("printr")
Error in install.packages : Updating loaded packages
library(C50)
library(printr)
library(caret)
#train using the trainData and create the c5.0 gain ratio tree
CloseTree <- C5.0(myFormula, data=trainData)
summary(CloseTree)
Call:
C5.0.formula(formula = myFormula, data = trainData)
C5.0 [Release 2.07 GPL Edition] Fri Dec 1 06:15:19 2023
-------------------------------
Class specified by attribute `outcome'
Read 655 cases (5 attributes) from undefined.data
Decision tree:
low <= 0.2960392: 1 (254/6)
low > 0.2960392: 2 (401/1)
Evaluation on training data (655 cases):
Decision Tree
----------------
Size Errors
2 7( 1.1%) <<
(a) (b) <-classified as
---- ----
248 1 (a): class 1
6 400 (b): class 2
Attribute usage:
100.00% low
Time: 0.0 secs
plot(CloseTree)
Third one 80-20, which means Training(80%) and Testing(20%):
# a fixed random seed to make results reproducible
set.seed(1234)
# 1.Split the datasets into two subsets: Training(80%) and Testing(20%):
ind1 <- sample(2, nrow(dataset), replace=TRUE, prob=c(0.80 , 0.20))
trainData <- dataset[ind1==1,]
testData <- dataset[ind1==2,]
2.Determine the predictor attributes and the class label attribute.( the formula):
library(party)
#myFormula
myFormula <- close ~volume+open+high+low
3.Build a decision tree using training set and check the Prediction:
dataset_ctree <- ctree(myFormula, data=trainData)
table(predict(dataset_ctree), trainData$close)
1 2
1 322 14
2 0 535
# 4.Print and plot the tree:
print(dataset_ctree)
Conditional inference tree with 4 terminal nodes
Response: close
Inputs: volume, open, high, low
Number of observations: 871
1) open <= 0.2974608; criterion = 1, statistic = 478.791
2) high <= 0.2892353; criterion = 1, statistic = 22.684
3)* weights = 303
2) high > 0.2892353
4)* weights = 19
1) open > 0.2974608
5) low <= 0.2997876; criterion = 0.997, statistic = 11.651
6)* weights = 14
5) low > 0.2997876
7)* weights = 535
plot(dataset_ctree, type="simple")
# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
testPred 1 2
1 74 2
2 0 149
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
Error in install.packages : Updating loaded packages
library(e1071)
library(caret)
co_result <- confusionMatrix(result)
print(co_result)
Confusion Matrix and Statistics
testPred 1 2
1 74 2
2 0 149
Accuracy : 0.9911
95% CI : (0.9683, 0.9989)
No Information Rate : 0.6711
P-Value [Acc > NIR] : <2e-16
Kappa : 0.98
Mcnemar's Test P-Value : 0.4795
Sensitivity : 1.0000
Specificity : 0.9868
Pos Pred Value : 0.9737
Neg Pred Value : 1.0000
Prevalence : 0.3289
Detection Rate : 0.3289
Detection Prevalence : 0.3378
Balanced Accuracy : 0.9934
'Positive' Class : 1
sensitivity(as.table(co_result))
[1] 1
specificity(as.table(co_result))
[1] 0.986755
precision(as.table(co_result))
[1] 0.9736842
acc <- co_result$overall["Accuracy"]
acc
Accuracy
0.9911111
# For decision tree model
install.packages("rpart")
Error in install.packages : Updating loaded packages
library(rpart)
# For data visualization
library(rpart.plot)
dataset.cart <- rpart(myFormula, data = trainData, method = "class", parms = list(split = "gini"))
Visualizing the unpruned tree
library(rpart.plot)
rpart.plot(dataset.cart)
Checking the order of variable importance
dataset.cart$variable.importance
low high open volume
386.324609 371.012963 368.657326 4.711276
pred.tree = predict(dataset.cart, testData, type = "class")
table(pred.tree,testData$close)
pred.tree 1 2
1 74 2
2 0 149
# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
testPred 1 2
1 74 2
2 0 149
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
Error in install.packages : Updating loaded packages
library(e1071)
library(caret)
co_result <- confusionMatrix(result)
print(co_result)
Confusion Matrix and Statistics
testPred 1 2
1 74 2
2 0 149
Accuracy : 0.9911
95% CI : (0.9683, 0.9989)
No Information Rate : 0.6711
P-Value [Acc > NIR] : <2e-16
Kappa : 0.98
Mcnemar's Test P-Value : 0.4795
Sensitivity : 1.0000
Specificity : 0.9868
Pos Pred Value : 0.9737
Neg Pred Value : 1.0000
Prevalence : 0.3289
Detection Rate : 0.3289
Detection Prevalence : 0.3378
Balanced Accuracy : 0.9934
'Positive' Class : 1
sensitivity(as.table(co_result))
[1] 1
specificity(as.table(co_result))
[1] 0.986755
precision(as.table(co_result))
[1] 0.9736842
acc <- co_result$overall["Accuracy"]
acc
Accuracy
0.9911111
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages("C50")
Error in install.packages : Updating loaded packages
install.packages("printr")
Error in install.packages : Updating loaded packages
library(C50)
library(printr)
library(caret)
#train using the trainData and create the c5.0 gain ratio tree
CloseTree <- C5.0(myFormula, data=trainData)
summary(CloseTree)
Call:
C5.0.formula(formula = myFormula, data = trainData)
C5.0 [Release 2.07 GPL Edition] Fri Dec 1 06:15:21 2023
-------------------------------
Class specified by attribute `outcome'
Read 871 cases (5 attributes) from undefined.data
Decision tree:
low > 0.2960392:
:...open > 0.3106603: 2 (518)
: open <= 0.3106603:
: :...high <= 0.2916803: 1 (2)
: high > 0.2916803: 2 (23)
low <= 0.2960392:
:...high <= 0.2892354: 1 (302)
high > 0.2892354:
:...open <= 0.278852: 2 (2)
open > 0.278852:
:...high <= 0.3075281: 1 (22/4)
high > 0.3075281: 2 (2)
Evaluation on training data (871 cases):
Decision Tree
----------------
Size Errors
7 4( 0.5%) <<
(a) (b) <-classified as
---- ----
322 (a): class 1
4 545 (b): class 2
Attribute usage:
100.00% low
65.33% open
40.53% high
Time: 0.0 secs
plot(CloseTree)
after doing all the three methods we have noticed that in IG and Gini Index(CART)
the Training(70%) and Testing(30%) has sensitivity = 0.9959016 specificity = 0.9685039 Accuracy = 0.9865229
the Training(60%) and Testing(40%) has sensitivity = 0.9969512 specificity = 0.9710983 Accuracy = 0.988024
the Training(80%) and Testing(20%) has sensitivity = 0.9940476 specificity = 0.9655172 Accuracy = 0.9843137
which means that the best spilting in our dataset is the Training(60%) and Testing(40%) because it is has the highest sensitivity = 0.9940476 %99.4 , specificity = 0.9655172 %96.5 , Accuracy = 0.988024 %98.8
Clustering is unsupervised learning, it doesn’t use a class label for implementing the cluster. To implement the clusters, we used the K-mean algorithm, which is an algorithm that produces K clusters, which each cluster is represented by the center point of the cluster and assigns each object to the nearest cluster, then iteratively recalculates the center, and reassigns the object until the center point of each cluster does not change that means the object in the right cluster.
factoextra packages is used to help in implementing the clustering technique. scale() method is used for scaling and centering of data set objects, Kmeans() method to find a specified number of clusters. fviz_cluster() method to visualize the clusters diagram. silhouette() method to calculate the average for each cluster, fviz_silhouette() to visualize it, and fviz_nbclust() method to set a comparison between the three different numbers of clusters to find the optimal number by evaluating the clusters according to how well the clusters are separated, and how compact the clusters are. In both techniques, we used the method set.seed() with the same random number each time we try a different size to ensure that we get the same result each time.
Data types should be transformed into numeric types before clustering.
# prepreocessing
#Data types should be transformed into numeric types before clustering.
dataset <- scale(dataset)
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
# k-means clustering to find 4 clusters
#set a seed for random number generation to make the results reproducible
set.seed(8953)
kmeans.result <- kmeans(dataset, 4)
visualization of 4 clusters
# visualize clustering
#install.packages("factoextra")
library(factoextra)
fviz_cluster(kmeans.result, data = dataset)
average silhouette width for each clusters
#average silhouette for each clusters
library(cluster)
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset))
#a dissimilarity object inheriting from class dist or coercible to one. If not specified, dmatrix must be.
fviz_silhouette(avg_sil)#k-means clustering with estimating k and initializations
total within-cluster sum of square and BCubed precision and recall
# Total sum of squares
kmeans.result$tot.withinss
[1] 1900.127
#bcubed metrix that take the avg of precision&recall
library('DPBBM')
c = kmeans.result$cluster
BCubed_metric(kmeans.result$cluster, 0.50)
Error in BCubed_metric(kmeans.result$cluster, 0.5) :
length of category does not comply with length of cluster
print the clustering result
Apply k-means clustering for value 3
# run k-means clustering to find 3 clusters
#set a seed for random number generation to make the results reproducible
set.seed(8953)
kmeans.result <- kmeans(dataset, 3)
visualization of 3 clusters
# visualize clustering
#install.packages("factoextra")
library(factoextra)
Warning: package ‘factoextra’ was built under R version 4.3.2Loading required package: ggplot2
Warning: package ‘ggplot2’ was built under R version 4.3.2Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_cluster(kmeans.result, data = dataset)
average silhouette width for each clusters
#average silhouette for each clusters
library(cluster)
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset))
#a dissimilarity object inheriting from class dist or coercible to one. If not specified, dmatrix must be.
fviz_silhouette(avg_sil)#k-means clustering with estimating k and initializations
total within-cluster sum of square and BCubed precision and recall
# Total sum of squares
kmeans.result$tot.withinss
[1] 2908.955
#bcubed metrix that take the avg of precision&recall
library('DPBBM')
Warning: package ‘DPBBM’ was built under R version 4.3.2
c = kmeans.result$cluster
BCubed_metric(kmeans.result$cluster, 0.6)
Error in BCubed_metric(kmeans.result$cluster, 0.6) :
length of category does not comply with length of cluster
print the clustering result
Apply k-means clustering for value 2
# run k-means clustering to find 2 clusters
#set a seed for random number generation to make the results reproducible
set.seed(8953)
kmeans.result <- kmeans(dataset, 2)
visualization of 3 clusters
# visualize clustering
#install.packages("factoextra")
library(factoextra)
fviz_cluster(kmeans.result, data = dataset)
average silhouette width for each clusters
#average silhouette for each clusters
library(cluster)
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset))
#a dissimilarity object inheriting from class dist or coercible to one. If not specified, dmatrix must be.
fviz_silhouette(avg_sil)#k-means clustering with estimating k and initializations
total within-cluster sum of square and BCubed precision and recall
# Total sum of squares
kmeans.result$tot.withinss
[1] 4126
#bcubed metrix that take the avg of precision&recall
library('DPBBM')
c = kmeans.result$cluster
BCubed_metric(kmeans.result$cluster, 0.6)
Error in BCubed_metric(kmeans.result$cluster, 0.6) :
length of category does not comply with length of cluster
print the clustering result
kmeansruns() calls kmeans() to perform k-means clustering It initializes the k-means algorithm several times with random points from the data set as means. It estimates the number of clusters by index or average silhouette width
install.packages("fpc")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/fpc_2.2-10.zip'
Content type 'application/zip' length 839705 bytes (820 KB)
downloaded 820 KB
package ‘fpc’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\shade\AppData\Local\Temp\RtmpktBzoJ\downloaded_packages
library(fpc)
Warning: package ‘fpc’ was built under R version 4.3.2
#kmeansruns() : It calls kmeans() to perform k-means clustering
#It initializes the k-means algorithm several times with random points from the data set as means.
#It estimates the number of clusters by index or average silhouette width
kmeansruns.result <- kmeansruns(dataset)
kmeansruns.result
K-means clustering with 4 clusters of sizes 290, 216, 408, 182
Cluster means:
open volume adjClose adjHigh adjLow
1 -1.10632007 -0.4407852 -1.10386041 -1.10816456 -1.10089108
2 1.58007963 0.2687019 1.58272008 1.58333842 1.58085838
3 0.06915076 -0.5133559 0.06973682 0.06332081 0.07637218
4 -0.26746094 1.5342709 -0.27582770 -0.25532015 -0.29322444
adjOpen adjVolume
1 -1.10632007 -0.4407852
2 1.58007963 0.2687019
3 0.06915076 -0.5133559
4 -0.26746094 1.5342709
Clustering vector:
1 2 3 5 6 7 8 10 11 12 13 14 15
1 1 1 1 1 1 1 4 1 1 1 1 1
16 17 18 19 20 21 22 23 24 25 26 27 28
1 1 1 1 1 1 1 1 1 1 1 1 1
29 30 31 34 35 36 37 38 39 40 41 42 43
1 1 1 4 1 1 1 1 1 1 1 1 1
44 45 46 47 48 49 50 51 52 53 54 55 56
1 1 1 1 1 1 1 1 1 1 1 1 1
57 58 59 60 61 62 63 64 65 66 67 68 69
1 1 1 1 1 1 1 1 1 1 1 1 1
70 71 72 73 74 75 76 77 78 79 80 81 82
1 1 1 1 1 1 1 1 1 1 1 1 1
83 84 85 86 87 88 89 90 91 92 93 94 95
1 1 1 1 1 1 4 1 1 1 1 1 1
96 98 99 100 101 102 103 104 105 107 109 110 111
4 4 4 1 1 4 1 1 4 4 4 1 1
112 113 114 115 116 117 118 119 121 122 123 124 125
1 1 1 1 1 4 1 4 1 1 1 1 1
126 127 128 129 130 131 132 133 134 135 136 137 138
1 4 4 1 1 4 1 1 1 1 1 1 1
139 140 141 142 143 144 145 146 147 148 149 150 151
1 1 1 1 1 1 1 1 1 1 1 1 1
152 153 154 155 156 160 161 162 163 164 165 166 167
1 1 1 1 1 4 4 1 1 1 1 1 1
168 169 170 171 172 173 174 175 176 177 178 179 180
1 1 1 1 1 1 1 1 1 1 1 4 1
181 182 183 184 185 186 187 188 189 190 191 192 193
1 1 1 1 1 1 1 1 1 1 1 1 1
194 195 197 198 199 200 201 202 203 204 205 206 207
4 1 4 1 1 1 1 1 1 1 1 1 1
208 209 210 211 212 213 214 215 216 217 218 219 220
1 1 1 1 1 1 1 1 1 1 1 1 1
222 223 224 225 226 227 228 229 230 231 232 233 234
4 1 1 1 4 1 1 1 1 1 1 1 4
235 236 237 238 239 240 241 242 243 244 245 246 247
1 1 1 1 1 1 1 1 4 1 4 1 4
248 249 252 253 254 256 257 258 259 260 261 262 263
1 1 4 1 4 1 1 1 1 1 1 4 4
265 266 267 268 269 270 271 272 273 274 275 276 277
4 1 1 1 1 1 1 1 1 1 1 1 1
278 279 280 282 284 285 286 287 288 289 290 291 292
1 1 4 4 4 4 1 4 1 1 1 1 1
293 294 295 296 297 298 299 300 301 302 303 304 305
1 1 1 1 1 1 1 1 1 1 1 1 1
306 307 308 309 310 311 312 313 314 315 316 317 318
1 1 1 1 1 1 1 1 1 1 1 1 4
319 320 321 322 323 324 325 326 327 328 329 330 331
1 1 1 1 1 4 1 4 1 1 1 1 1
332 333 334 335 336 337 338 339 340 341 342 343 344
1 1 1 1 1 1 1 1 1 1 1 1 1
345 346 347 349 350 351 352 353 354 355 356 357 358
1 1 1 4 3 3 3 3 3 3 3 3 3
359 360 361 362 363 364 365 366 367 368 369 370 371
3 3 3 3 3 3 3 3 3 3 3 4 4
372 373 374 375 376 377 378 379 380 381 383 384 385
4 4 4 3 3 3 3 3 3 3 3 3 3
386 387 388 389 390 391 392 393 394 395 396 397 398
3 3 3 3 3 3 3 3 3 3 3 3 3
399 400 401 402 403 404 405 406 407 408 409 410 411
3 3 3 3 3 3 3 3 3 3 4 3 3
412 413 417 418 420 421 422 423 424 425 426 427 428
3 4 4 4 4 3 3 4 3 3 3 3 3
429 430 431 432 433 434 435 436 437 438 439 440 441
3 4 4 4 4 3 3 3 3 4 4 4 3
442 443 444 445 446 447 448 449 452 453 454 455 456
3 4 4 4 3 4 4 4 4 4 4 4 3
457 458 459 460 461 462 463 464 465 466 467 468 470
4 4 4 3 3 3 3 4 3 4 4 4 4
471 472 473 474 475 476 477 478 479 480 481 482 483
4 3 4 3 3 4 4 3 3 4 3 3 3
484 485 486 487 488 489 490 491 492 493 494 496 497
3 3 3 3 3 3 3 3 3 4 3 4 4
498 499 500 501 502 503 504 505 506 507 508 509 510
3 3 3 3 3 3 3 3 4 3 3 3 3
511 512 513 514 515 516 517 518 519 520 521 522 523
3 4 3 3 3 3 3 3 3 3 3 3 3
524 525 526 527 528 529 530 531 533 534 535 536 537
3 3 3 3 3 3 3 4 4 4 4 4 3
538 539 540 541 542 543 544 545 546 547 548 549 550
3 3 3 3 3 3 3 3 3 3 4 3 3
551 552 553 554 555 556 557 558 559 560 561 562 563
3 3 3 3 3 3 3 3 3 4 4 4 4
564 565 566 567 568 569 570 571 572 573 575 576 577
3 3 3 3 3 3 3 3 3 3 3 3 3
578 579 580 581 582 583 584 585 586 587 589 590 591
3 3 3 3 3 4 3 4 3 4 4 3 4
592 593 594 595 596 597 598 602 603 604 605 606 607
3 4 3 3 4 4 4 4 3 4 4 3 4
608 609 610 611 612 613 614 615 616 617 618 619 620
3 3 3 3 3 4 3 4 4 3 3 4 4
621 622 623 624 625 626 627 628 629 630 631 632 633
4 3 4 4 4 4 4 4 3 3 3 4 4
634 635 636 638 639 640 641 642 643 644 645 646 647
4 4 4 1 4 4 3 3 3 4 4 4 4
648 649 650 651 652 653 654 655 656 657 658 659 660
3 3 3 3 3 3 3 4 3 3 3 3 3
661 662 663 664 665 667 668 669 670 671 672 673 674
3 3 3 3 4 4 4 3 3 3 3 3 3
675 676 677 678 679 680 681 682 683 684 685 686 687
3 3 3 3 3 3 3 3 3 3 3 3 3
688 689 690 691 692 693 694 695 696 697 698 699 700
3 3 4 3 3 4 3 3 4 3 3 3 4
701 702 703 704 705 706 707 708 709 710 711 712 713
3 3 3 3 3 3 3 3 3 3 3 3 3
714 715 716 717 718 719 720 721 722 723 725 726 727
3 3 3 3 3 3 3 3 3 4 4 4 4
728 729 730 731 732 733 734 735 736 737 738 739 740
3 3 3 3 3 4 4 4 3 3 3 3 3
741 742 743 744 745 746 749 750 751 752 753 754 755
3 3 3 3 3 3 4 4 4 3 3 3 3
756 757 758 759 760 761 762 763 764 765 766 767 768
3 3 3 3 3 4 3 3 4 3 4 3 3
769 770 771 772 773 774 775 776 777 778 779 780 781
3 3 3 3 3 3 3 3 3 3 3 3 3
782 783 784 786 787 788 789 790 791 792 793 794 795
3 3 4 4 3 3 3 3 4 3 3 3 3
796 797 798 799 800 801 802 803 804 805 806 807 808
3 3 3 3 3 3 3 3 3 3 3 3 3
809 810 811 812 813 814 815 816 817 818 819 820 821
3 3 3 3 3 3 3 3 3 3 3 3 3
822 823 824 825 826 827 828 829 830 831 832 833 834
3 3 4 3 3 3 3 3 3 3 3 3 3
835 836 837 838 839 840 841 842 843 844 845 846 847
3 3 3 3 3 3 3 3 3 3 3 3 3
848 849 850 851 852 853 854 855 856 857 858 859 860
3 3 4 4 3 3 3 3 3 3 4 3 3
861 862 863 864 865 866 867 868 869 870 871 872 873
3 3 3 2 3 3 3 3 3 3 3 3 3
874 875 876 877 878 879 880 881 882 883 884 885 886
3 3 3 3 3 3 3 3 3 2 3 2 2
887 889 890 891 892 893 894 895 896 897 898 899 900
3 3 3 3 3 3 3 3 3 2 2 2 2
901 902 903 904 905 906 907 908 909 910 911 912 913
2 2 2 2 2 2 2 2 2 2 2 2 2
914 915 918 919 920 921 922 923 924 925 926 927 928
2 2 2 2 2 2 2 2 2 2 2 2 2
929 931 932 935 936 937 938 939 941 942 955 956 957
2 2 2 4 4 2 4 4 4 4 4 4 4
958 959 960 961 962 963 964 965 966 967 968 969 970
4 4 4 4 4 4 3 4 3 4 4 3 4
971 972 973 974 977 978 979 980 981 982 983 984 985
4 3 3 3 4 4 3 2 3 2 2 2 2
986 987 988 989 990 991 992 993 994 995 996 997 998
2 2 2 2 2 2 2 2 2 2 2 2 2
999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011
2 2 2 2 2 2 2 2 2 2 2 2 2
1013 1014 1015 1016 1018 1019 1020 1021 1022 1023 1024 1025 1026
2 2 2 2 2 2 2 2 2 2 2 2 2
1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039
2 2 2 2 2 2 2 2 2 2 2 2 2
1040 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053
2 2 2 2 2 2 2 2 2 2 2 2 2
1054 1055 1056 1057 1058 1060 1061 1062 1063 1064 1066 1067
2 2 2 2 2 2 2 2 2 2 2 2
[ reached getOption("max.print") -- omitted 96 entries ]
Within cluster sum of squares by cluster:
[1] 384.3763 663.7649 449.0264 402.9592
(between_SS / total_SS = 75.2 %)
Available components:
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault" "crit" "bestk"
fviz_cluster(kmeansruns.result, data = dataset)
k-mediods clustering with PAM
#install.packages("cluster")
library(cluster)
# group into 4 clusters
pam.result <- pam(dataset, 4)
plot(pam.result)
Hierarchical Clustering draw a sample of 40 records from the dataset data, so that the clustering plot will not be over crowded
##----Hierarchical Clustering of the Data-----##
set.seed(2835)
# draw a sample of 40 records from the dataset data, so that the clustering plot will not be over crowded
idx <- sample(1:dim(dataset)[1], 40)
dataset2 <- dataset[idx, ]
## hiercrchical clustering
library(factoextra)
hc.cut <- hcut(dataset2, k = 2, hc_method = "complete") # Computes Hierarchical Clustering and Cut the Tree
# Visualize dendrogram
fviz_dend(hc.cut,rect = TRUE) #logical value specifying whether to add a rectangle around groups.
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.
# Visualize cluster
fviz_cluster(hc.cut, ellipse.type = "convex") # Character specifying frame type. Possible values are 'convex', 'confidence' etc
define function to compute average silhouette for k clusters using silhouette()
silhouette_score <- function(k){
km <- kmeans(USArrests, centers = k,nstart=25) # if centers is a number, how many random sets should be chosen?
ss <- silhouette(km$cluster, dist(USArrests))
sil<- mean(ss[, 3])
return(sil)
}
# k cluster range from 2 to 10
k <- 2:10
## call function fore k value
avg_sil <- sapply(k, silhouette_score) ##Apply a Function over a List or Vector
plot(k, type='b', avg_sil, xlab='Number of clusters', ylab='Average Silhouette Scores', frame=FALSE)
silhouette method
after doing 3 sizes of k and based the plot and drawing we have noticed that The best size is K 2 , it is Partition better than the other
Our dataset represents opening and closing prices of google stocks in market. Our goal was to predict higher closing prices that indicate a positive trend in Google stock. To have the best, accurate, and precise results we used several data mining preprocessing techniques that improve the efficiency of the data. applied several plotting methods was applied to help us understand our data. Based on plots we removed outliers, we didn’t find any null or missing values. And then data transformation was applied to transform attribute values such as normalization discretization.
Then we applied the data mining tasks, that are classification and clustering. For classification, we use the decision tree method to construct our model, 3 different sizes of training and testing data were used to get the best result for construction and evaluation. the following results for diffrernt sizes: - 70% Training data, 30% Test data, accuracy = 0. 9865229% - 60% Training data, 40% Test data, accuracy = 0. 988024% - 80% Training data, 20% Test data, accuracy = 0. 9843137% In conclusion, the most accurate model was the second model with 60% training data and 40% test data which means that most tuples were correctly classified.
For Clustering, 3 different sizes K were used in K-means algorithm to find the optimal number of clusters. average silhouette width for each K was calculated to conclude shown results.
Since the highest average silhouette width is where the number of
clusters equals to 2 it has the optimal number of clusters. The higher
the average silhouette width the closer the objects within the same
cluster to each other and as far as possible to the objects in the other
cluster.
At the end, both models are helpful and helped us in predicting. But
since our dataset is numeric and after doing the clustering and
Classification we have noticed that the clustering fits more for the
dataset because it’s concept all about the numeric data.